How to analyse scover outputΒΆ
Introduction
This vignette focusses on downstream analysis of the output from scover. After running scover, it outputs different types of information. The most important are:
- The alignments of the found convolutional kernels to a database of motifs, with significance (q-)scores for each alignment
- The influence scores of the convolutional kernels in neural network, representing the change in prediction when a particular motif is left out
The central part of scover trains 10 different neural network models, each using a different 90% of the dataset for training and the remaining 10% for testing. Each model contains (by default) 300 convolutional filters, so there are a total of 3,000 motifs. To ensure that the results are robust, scover seeks to identify motif clusters that show up repeatedly, therefor we require that a motif cluster should show up in at least half of the models.
The motifs are named as MODELNUMBER_MOTIFNUMBER, e.g.Β motif 2_3 is motif 3 from model 2. The best performing out of 10 models is named βbestβ rather than its model number, e.g.Β motif best_129.
All of the motifs are aligned with Tomtom, the motif comparison tool from the MEME Suite, against a database of known motifs. This generates a tab-separated file containing what motifs aligned to what reference motifs, and the q-values for the alignments. This result can be represented as a bipartite graph where the model motifs and the database motifs represent the two categories of nodes. Then it becomes possible to select reproducible motifs. For a visual overview, see the figure below.
One of the outputs from the model is a HDF5 file containing motif influence scores for each model. These are calculated for each convolutional filter in each pool - these scores represent the mean change in prediction for a particular pool when a convolutional filter is left out.
Below is a visual overview of the workflow for motif analysis using scover. The downstream analysis described in this section starts at the third step, where we have obtained influence scores from the models and the motif alignments from Tomtom.
Load packages
I use the following R packages during this guide (see the sessionInfo() at the end of this guide for an overview of all the loaded packages):
suppressMessages(library(rhdf5))
suppressMessages(library(mixtools))
suppressMessages(library(SingleCellExperiment))
suppressMessages(library(dplyr))
suppressMessages(library(magrittr))
suppressMessages(library(reshape2))
suppressMessages(library(igraph))
suppressMessages(library(stringr))
suppressMessages(library(ggplot2))
suppressMessages(library(ggthemes))
suppressMessages(library(gghalves))
suppressMessages(library(ggrepel))
suppressMessages(library(pheatmap))
suppressMessages(library(plotly))
suppressMessages(library(tidytext))
options(stringsAsFactors = F)
Iβve added some custom functions to the following file (find it in the resources folder of this repository):
source("scover_helper_functions.R")
Set paths
For the rest of the analysis, I am using paths to the network output (exp_folder), the input dataset (data_path), and the cell type annotation (colData_path). I also need to input the motif database that was used for motif alignment (motif_db_path). I will also create an output directory for the analysis.
# Network output dir
exp_folder <- "scover_output/"
# Network input datasets
data_path <- "data/20200825_combined_mature_fetal_kidney_pool120_5u5d.tsv.gz"
colData_path <- "data/20200825_combined_mature_fetal_kidney_pool120_5u5d_colData.tsv"
# Motif alignment db
motif_db_path <- "data/Homo_sapiens.meme"
# Output directory
outdir <- "analysis"
dir.create(outdir)
## Warning in dir.create(outdir): 'analysis' already exists
Read data
First, Iβll use the custom functions to read the data, tomtom output, and the network influence scores. Then Iβll filter the tomtom output to only include TFs that were expressed
sce <- scover_dataset_to_sce(data_path = data_path,
colData_path = colData_path)
tomtom_table <- read_tomtom_output(motif_db_path = motif_db_path,
exp_folder = exp_folder)
influence_scores <- read_network_hdf5(exp_folder, sce)
# Filter the tomtom table to only include expressed TFs
tomtom_table <- filter_tomtom_table(tomtom_table = tomtom_table, sce = sce)
Graph analysis
Now Iβll generate a motif alignment graph (as in the other guide) and plot the graph:
alignment_graph <- create_alignment_graph(tomtom_table)
# Cluster alignment graph
alignment_graph_clusters <- igraph::cluster_walktrap(alignment_graph)
alignment_graph_cluster_groups <- igraph::groups(alignment_graph_clusters)
# Plot two alignment graphs
plot_alignment_graph(outdir, alignment_graph, alignment_graph_clusters)
To get the motif clusters into a dataframe, I use a custom function:
cluster_df <- construct_cluster_df(exp_folder = exp_folder,
alignment_graph_cluster_groups =
alignment_graph_cluster_groups)
I also want to construct a matrix where the motif influence scores are averaged per cell type:
average_influence_matrix <- construct_average_influence_matrix(influence_scores)
I use this function to get information about the motifs in this matrix. This will also plot the motif annotation as βgoodβ or βbadβ:
motif_stats <- get_motif_stats(cluster_df = cluster_df,
average_influence_matrix = average_influence_matrix,
pseudocount = 1e-11)
## number of iterations= 211
## No clear bimodal distribution: likely not a lot of 'dead motifs'
The next step will select βgood motifsβ based on impact, motif cluster reproducibility, and motif cluster annotation (only the clusters that aligned to known motifs will be included in this analysis):
# Get names of good motifs: the ones that have high impact, are from
# reproducible motif clusters, and are aligned to known motifs
motif_stats <- motif_stats[motif_stats$cluster_annot != "NA",]
good_motifs <- rownames(motif_stats)[motif_stats$is_good_motif=="Yes" &
as.numeric(motif_stats$cluster_reprod) >= 0.5]
influence_scores <- influence_scores[good_motifs,]
motif_stats <- motif_stats[good_motifs,]
average_influence_matrix <- average_influence_matrix[good_motifs,]
Plotting the motif influence scores across cell types
If we were to plot the average influence scores of the motifs across the cell types, it would look something like this:
# Only use this function if you have 10 or fewer motif clusters at this point.
annotation_colors <- get_annotation_colors(motif_stats)
# Set colors for heatmaps here
heatmap_colors <- colorRampPalette(c("magenta", "black", "yellow"))(100)
# The "drop = FALSE" will make sure the annotation stays a dataframe
pheatmap(average_influence_matrix,
annotation_row = motif_stats[,c("cluster_annot"), drop = FALSE],
show_rownames = FALSE, angle_col = 45,
annotation_colors = annotation_colors,
cellwidth = 8,
cellheight = .4,
color = heatmap_colors)
Here, every row corresponds to a convolutional filter and every column corresponds to a cell type.
As you can see, a small subset of the motifs has the largest impact. If we instead plot Z-transformed motif influence scores, it looks like this:
# Use custom function to get
average_influence_matrix_z_scores <- to_z(average_influence_matrix)
pheatmap(average_influence_matrix_z_scores,
annotation_row=motif_stats[,c("cluster_annot"), drop=FALSE],
show_rownames = FALSE, angle_col = 45,
annotation_colors = annotation_colors,
cellwidth = 8,
cellheight = .4,
color = heatmap_colors)
Already you can see that the motif clusters cluster together much better. Do note that the relative impacts of motifs in a single cell type cannot be deduced from this heatmap, as the values have been transformed per row. What this does show is the general βtrendβ of the influence scores of a motif across the cell types.
Z-scores can, however, inflate the differences between cell types. This is why it is beneficial to have another score that tells you something about how βcell-type-specificβ a motif influence score really is. For this, the βmotif cell type entropyβ score has been calculated for each motif, which is the normalised Shannon entropy of a motif influence score across cell types. The lower this score is, the further from a uniform distribution the influence score distribution is.
Letβs plot the motifs along with the entropy scores. I will also order the motifs for their motif cluster annotation:
pheatmap(average_influence_matrix_z_scores[order(motif_stats$cluster_annot),],
annotation_row=motif_stats[,c("cluster_annot",
"motif_celltype_entropy")],
show_rownames = FALSE, angle_col = 45,
annotation_colors = annotation_colors,
cluster_rows = FALSE,
cellwidth = 8,
cellheight = .4,
color = heatmap_colors)
Better cell type annotation
To get an overview at the βcell type categoryβ level, rather than the individual cell type level, you can do some other types of analysis. For this, Iβve created (for this specific example) a named vector with the different cell types as names that contains the corresponding cell type category:
Category_table <- readRDS("data/Category_table.RDS")
print(head(Category_table))
## Fetal B cell Fetal Cap mesenchyme
## "Immune" "Nephron progenitor"
## Fetal CD4 T cell Fetal cDC2
## "Immune" "Immune"
## Fetal CNT/PC - proximal UB Fetal Distal renal vesicle
## "Nephron epithelium" "Nephron progenitor"
To add this information to the sce object, you can use this:
sce$Category <- Category_table[sce$cell_type1]
Refining motif cluster annotations
The current motif cluster annotation is not very specific, but this is a consequence of the fact that most motifs can be bound by multiple TFs making it impossible to tell which one is binding. Nevertheless, if expression information is available we can take advantage of it to identify the most likely candidate. The underlying assumption here is that TFs with a higher (absolute) correlation are more likely to be involved in the regulation. Please note that this is one of the more bespoke steps in this analysis and it is difficult to fully automate. For better control over the steps in this analysis, I would recommend looking at the corresponding functions in scover_helper_functions.R.
The following code will produce plots in the outdir that show the expression pattern for the different cluster candidates, including a Spearman correlation score for the different cluster candidate TFs. In addition, it will generate directories per motif cluster that contain the different plots that show how the expression and influence values are related. It will also return a DataFrame that contains the relevant information.
cluster_correlations_df <- plot_candidate_tf_information(cluster_df = cluster_df,
influence_matrix = influence_scores,
motif_stats = motif_stats,
sce = sce,
tomtom_table = tomtom_table)
##
## Attaching package: 'grid'
## The following object is masked from 'package:mixtools':
##
## depth
As an example we consider the cluster with YY1, YY2, and THAP1. One of the figures generated by the above function shows the expression levels of the different genes in the family across cell types:
YY1 has a higher expression than YY2, making it a more likely regulator than YY2. In addition, on the left we can see that YY1 and YY2 have a higher correlation to the motif weights (green) and more Tomtom alignments (purple) than THAP1, making one of them the more likely candidate.
Letβs take a closer look at the correlation plots for these two:
Here, the cell type categories are annotated by color.
We can now update motif cluster annotations using the correlation information. Sometimes it is hard to identify a single transcription factor, in which case I annotate the cluster with the motif family names. Sometimes it can be good to look at individual motifs in the motif cluster and look directly at the alignments in Tomtom; for this, see the all_tomtom/tomtom.html webpage generated by Tomtom.
I do this using the following code (I look at View(cluster_df[cluster_df$cluster_reproducibility >= 0.5,]) to see which number corresponds to which cluster):
cluster_df$cluster_annot[rownames(cluster_df) == "7"] <- "ETS motif family"
cluster_df$cluster_annot[rownames(cluster_df) == "11"] <- "ZBTB33/BRCA1"
cluster_df$cluster_annot[rownames(cluster_df) == "2"] <- "EGR/KLF motif families"
cluster_df$cluster_annot[rownames(cluster_df) == "9"] <- "YY1"
cluster_df$cluster_annot[rownames(cluster_df) == "3"] <- "bZIP motif family"
cluster_df$cluster_annot[rownames(cluster_df) == "1"] <- "ZFX"
# Update the annotation
motif_stats <- get_motif_stats(cluster_df = cluster_df,
average_influence_matrix = average_influence_matrix,
pseudocount = 1e-11)
## number of iterations= 254
## No clear bimodal distribution: likely not a lot of 'dead motifs'
annotation_colors <- get_annotation_colors(motif_stats)
pheatmap(average_influence_matrix_z_scores[order(motif_stats$cluster_annot),],
annotation_row=motif_stats[,c("cluster_annot",
"motif_celltype_entropy")],
show_rownames = FALSE, angle_col = 45,
annotation_colors = annotation_colors,
cluster_rows = FALSE,
cellwidth = 8,
cellheight = .4,
color = heatmap_colors)
Plotting all motif family correlations
It is helpful to be able to view the correlations for all TFs related to the motif families reported by the network. Using the following function, you can plot the correlations and expression values for the different motif families, while annotating the top x hits per motif family:
# First I update the motif family names:
cluster_correlations_df$cluster_annot[cluster_correlations_df$cluster == "7"] <- "ETS motif family"
cluster_correlations_df$cluster_annot[cluster_correlations_df$cluster == "11"] <- "ZBTB33/BRCA1"
cluster_correlations_df$cluster_annot[cluster_correlations_df$cluster == "2"] <- "EGR/KLF motif families"
cluster_correlations_df$cluster_annot[cluster_correlations_df$cluster == "9"] <- "YY1"
cluster_correlations_df$cluster_annot[cluster_correlations_df$cluster == "3"] <- "bZIP motif family"
cluster_correlations_df$cluster_annot[cluster_correlations_df$cluster == "1"] <- "ZFX"
plot_motif_family_correlation_overview(cluster_correlations_df, top_n = 5)
Aggregate influence scores for motif families
To get an idea of the influence score of a particular motif family, you can sum the influence scores of the motifs belonging to that family. Here, I cycle through the motif families and fill an empty matrix with the summed values of the corresponding motif influence scores:
# Create empty matrix
aggregate_motif_influences <- matrix(nrow=length(unique(motif_stats$cluster_annot)),
ncol=ncol(influence_scores))
# Cycle through motif families
for(i in 1:nrow(aggregate_motif_influences)){
current_motif_family <- unique(motif_stats$cluster_annot)[i]
curr_motifs <- rownames(motif_stats[motif_stats$cluster_annot == current_motif_family,])
current_motif_influences <- influence_scores[curr_motifs,]
aggregate_motif_influences[i,] <- colSums(current_motif_influences)
}
# Add column names: pool names
rownames(aggregate_motif_influences) <- unique(motif_stats$cluster_annot)
colnames(aggregate_motif_influences) <- colnames(influence_scores)
(With the same method, you can also derive motif family scores averaged per cell type category, for example.)
Now that we have these values, we can visualize them with ggplot. First I βmeltβ the dataframe, then I add cell type category information, and I use visualisation tools from the gghalves package.
aggregate_motif_influences_melt <- melt(aggregate_motif_influences)
colnames(aggregate_motif_influences_melt) <- c("motif_fam",
"pool",
"aggregate_influence")
aggregate_motif_influences_melt$celltype <- str_split_fixed(aggregate_motif_influences_melt$pool, "_", 2)[,1]
aggregate_motif_influences_melt$category <- Category_table[aggregate_motif_influences_melt$celltype]
ggplot(aggregate_motif_influences_melt, aes(x=category, y=aggregate_influence,
fill=category)) +
geom_half_boxplot() + geom_half_violin(side="r") + # from 'gghalves' package
theme_Nice() + # from 'scover_helper_functions.R'
scale_x_reordered() + # from 'tidytext' package
scale_fill_stata() + # from 'ggthemes' package
labs(x="Category", y="Aggregate motif influence") +
facet_wrap(~motif_fam, scales="free")
PCA plots based on motif influence scores or expression values
Now, I will do a PCA based on either:
- Motif influence scores across pools
- Expression values across pools
First, Iβll do the one based on motif influence scores:
pca_result <- prcomp(t(influence_scores))
# eigenvalues
eigs <- pca_result$sdev^2
variances_explained <- eigs/sum(sum(eigs))
pca_result <- as.data.frame(pca_result$x)
# See what the coordinates are:
print(pca_result[1:2,1:3])
## PC1 PC2 PC3
## Fetal B cell_pool1 -0.01976748 -0.03901322 0.002209844
## Fetal B cell_pool2 -0.01769229 -0.03388922 0.002284000
# Variances explained per PC
print(variances_explained[1:5])
## [1] 0.53261500 0.26351268 0.08142457 0.02166117 0.01295619
Iβll add cell type and cell type category annotation to this.
pca_result$Celltype <- colData(sce)[rownames(pca_result),]$cell_type1
pca_result$Category <- Category_table[pca_result$Celltype]
In this case, the dataset was made using two different datasets: fetal and mature kidney. Iβll add annotation for the origin of the pools this way (but this is specific to this analysis):
pca_result$Origin <- sapply(pca_result$Celltype, FUN=function(x){
str_split(x, " ")[[1]][1]
})
Now, Iβll plot the PCA plot, showing the cell type categories as colours, and the origin of the pools as the shapes:
# Function "scale_color_stata()" is from package ggthemes,
# and function "geom_label_repel()" is from package ggrepel.
ggplot(pca_result, aes(x=PC1, y=PC2, color=Category, shape=Origin)) + geom_point() +
scale_color_stata() + theme_Nice(angled=FALSE) + theme(legend.position = "right") +
labs(x=paste0("PC1 (", round(variances_explained[1]*100, 2), "%)"),
y=paste0("PC2 (", round(variances_explained[2]*100, 2), "%)")) +
coord_fixed()
Now, Iβll do exactly the same, but based on expression values:
pca_result_expression <- prcomp(t(logcounts(sce)))
eigs_expression <- pca_result_expression$sdev^2
variances_explained_expression <- eigs_expression/sum(sum(eigs_expression))
pca_result_expression <- as.data.frame(pca_result_expression$x)
# See what the coordinates are:
print(pca_result_expression[1:2,1:3])
## PC1 PC2 PC3
## Fetal B cell_pool1 -84.77157 -94.90995 -7.852795
## Fetal B cell_pool2 -86.36306 -91.49847 -6.077714
# Variances explained per PC
print(variances_explained_expression[1:5])
## [1] 0.38369206 0.14193260 0.03773392 0.03144243 0.02420017
pca_result_expression$Celltype <- sce$cell_type1
pca_result_expression$Category <- sce$Category
pca_result_expression$Origin <- sapply(pca_result_expression$Celltype,
FUN=function(x){
str_split(x, " ")[[1]][1]
})
ggplot(pca_result_expression, aes(x=PC1, y=PC2, color=Category, shape=Origin)) + geom_point() +
scale_color_stata() + theme_Nice(angled=FALSE) + theme(legend.position = "right") +
labs(x=paste0("PC1 (", round(variances_explained_expression[1]*100, 2), "%)"),
y=paste0("PC2 (", round(variances_explained_expression[2]*100, 2), "%)")) +
coord_fixed()
sessionInfo()
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas/liblapack.so.3
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid parallel stats4 stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] tidytext_0.2.6 plotly_4.9.2.1
## [3] pheatmap_1.0.12 ggrepel_0.8.2
## [5] gghalves_0.1.0 ggthemes_4.2.0
## [7] ggplot2_3.3.1 stringr_1.4.0
## [9] igraph_1.2.5 reshape2_1.4.4
## [11] magrittr_1.5 dplyr_1.0.0
## [13] SingleCellExperiment_1.8.0 SummarizedExperiment_1.16.1
## [15] DelayedArray_0.12.3 BiocParallel_1.20.1
## [17] matrixStats_0.56.0 Biobase_2.46.0
## [19] GenomicRanges_1.38.0 GenomeInfoDb_1.22.1
## [21] IRanges_2.20.2 S4Vectors_0.24.4
## [23] BiocGenerics_0.32.0 mixtools_1.2.0
## [25] rhdf5_2.30.1
##
## loaded via a namespace (and not attached):
## [1] viridis_0.5.1 httr_1.4.1 tidyr_1.1.0
## [4] jsonlite_1.6.1 viridisLite_0.3.0 splines_3.6.3
## [7] GenomeInfoDbData_1.2.2 yaml_2.2.1 pillar_1.4.4
## [10] lattice_0.20-40 glue_1.4.1 digest_0.6.25
## [13] RColorBrewer_1.1-2 XVector_0.26.0 colorspace_1.4-1
## [16] htmltools_0.4.0 Matrix_1.2-18 plyr_1.8.6
## [19] pkgconfig_2.0.3 zlibbioc_1.32.0 purrr_0.3.4
## [22] scales_1.1.1 tibble_3.0.1 generics_0.0.2
## [25] farver_2.0.3 ellipsis_0.3.1 withr_2.2.0
## [28] lazyeval_0.2.2 survival_3.1-8 crayon_1.3.4
## [31] evaluate_0.14 tokenizers_0.2.1 janeaustenr_0.1.5
## [34] MASS_7.3-51.5 SnowballC_0.7.0 segmented_1.1-0
## [37] tools_3.6.3 data.table_1.12.8 lifecycle_0.2.0
## [40] Rhdf5lib_1.8.0 kernlab_0.9-29 munsell_0.5.0
## [43] compiler_3.6.3 rlang_0.4.6 RCurl_1.98-1.2
## [46] htmlwidgets_1.5.1 bitops_1.0-6 labeling_0.3
## [49] rmarkdown_2.2 gtable_0.3.0 R6_2.4.1
## [52] gridExtra_2.3 knitr_1.28 stringi_1.4.6
## [55] Rcpp_1.0.4.6 vctrs_0.3.1 tidyselect_1.1.0
## [58] xfun_0.14